1
00:00:00,025 --> 00:00:06,190
[SOUND]
Hello everyone.

2
00:00:06,190 --> 00:00:09,880
Welcome back to the Heterogeneous Parallel
Programming class.

3
00:00:09,880 --> 00:00:14,450
This is lecture 1.5.
Introduction to CUDA

4
00:00:14,450 --> 00:00:19,389
in we are in the memory allocation and
data movement API functions part.

5
00:00:20,720 --> 00:00:25,500
The objective of this lecture is to help
you to learn the basic application

6
00:00:25,500 --> 00:00:30,640
programming interface functions or API
functions in CUDA host code.

7
00:00:30,640 --> 00:00:33,830
API functions is a standard way

8
00:00:33,830 --> 00:00:38,950
for industry to extend standard
programming languages.

9
00:00:38,950 --> 00:00:43,420
To support certain specialized
functionalities.

10
00:00:43,420 --> 00:00:46,830
In this case, the CUDA designers and
NVIDIA

11
00:00:46,830 --> 00:00:50,780
provided these API functions to help C
programmers to

12
00:00:50,780 --> 00:00:55,660
use throughput oriented devices such GPUs
in a Heterogeneous

13
00:00:55,660 --> 00:00:59,410
computing system The two types of API
functions that we

14
00:00:59,410 --> 00:01:02,950
will be looking at today are the device
memory

15
00:01:02,950 --> 00:01:07,710
allocation function and also the host
device data transfer functions.

16
00:01:09,380 --> 00:01:15,960
This slide shows the reviews that vector
addition example, and we used this

17
00:01:15,960 --> 00:01:20,460
example to illustrate data parallelism and
today we're going to

18
00:01:20,460 --> 00:01:25,070
use this example to show how you can
easily a

19
00:01:25,070 --> 00:01:29,870
standard sequential C program to do vector
addition into a a

20
00:01:29,870 --> 00:01:35,570
heterogeneous parallel piece of code to do
the same functionality.

21
00:01:35,570 --> 00:01:41,310
And just a reminder that each CUDA thread
will

22
00:01:41,310 --> 00:01:44,630
be doing, will be adding an element of A

23
00:01:44,630 --> 00:01:47,200
and element of B and assigned to element
of C.

24
00:01:48,480 --> 00:01:54,330
So here we will look at a traditional C
code for doing vector addition.

25
00:01:54,330 --> 00:02:01,120
So, here we show that in the main function
we in C we will be doing memory allocation

26
00:02:01,120 --> 00:02:06,370
and then we'll do some IO read to reading
A and B and then we will need to

27
00:02:07,800 --> 00:02:12,190
determine that elements in A and B so
[INAUDIBLE] N

28
00:02:12,190 --> 00:02:15,550
So at some point, we would want to do
vector addition.

29
00:02:15,550 --> 00:02:20,200
So we call vecAdd function with four
parameters.

30
00:02:20,200 --> 00:02:27,360
We give the pointers to, to A pointer,
pointers to A, B, C and then

31
00:02:27,360 --> 00:02:33,280
the number of elements in these vectors.
So, up here, we show a

32
00:02:33,280 --> 00:02:37,360
simple sequential function for doing
vector addition in C.

33
00:02:38,650 --> 00:02:43,710
It matches the parameters that our main
function is going to pass to it.

34
00:02:43,710 --> 00:02:45,957
So it's going to say that, oh I

35
00:02:45,957 --> 00:02:52,070
have three pointers are pointing to
floating point arrays.

36
00:02:52,070 --> 00:02:54,350
So it will be A, B and C and then it

37
00:02:54,350 --> 00:02:58,580
expects a integer value that gives the
number of elements in

38
00:02:58,580 --> 00:03:03,180
A, B and C.
Then there is going to be a loop, for loop

39
00:03:03,180 --> 00:03:08,470
with a loop variable i and this for loop
was sequentially goes through all the

40
00:03:08,470 --> 00:03:14,220
elements of A, B add them up and assign to
the corresponding C element.

41
00:03:14,220 --> 00:03:18,360
So this is a sequential code, because the
root will sequentially

42
00:03:18,360 --> 00:03:23,148
visit all the A B elements and generate
the corresponding C element.

43
00:03:23,148 --> 00:03:28,230
Now we're going to show how we can
systematically

44
00:03:28,230 --> 00:03:32,510
convert this piece of code into a Parallel
CUDA code.

45
00:03:34,590 --> 00:03:35,090
The

46
00:03:37,150 --> 00:03:43,920
this slide shows the outline of how we can
change that vector add function

47
00:03:43,920 --> 00:03:49,780
to be, to use the throughput-oriented GPU
devices.

48
00:03:49,780 --> 00:03:56,360
So instead of performing the, the actual
complication, this function is actually

49
00:03:56,360 --> 00:04:02,580
going to call a kernel function that will
be executed on the device and in,

50
00:04:02,580 --> 00:04:08,240
before calling that function the this
function also needs to do some outsourcing

51
00:04:08,240 --> 00:04:13,420
activity.
It needs to copy data from the host memory

52
00:04:13,420 --> 00:04:19,455
into the device memory, so that the device
is ready to, to process the data.

53
00:04:19,455 --> 00:04:23,160
And eventually after the device completes
it's computation and

54
00:04:23,160 --> 00:04:27,580
needs to copy the data's C vector back
into the

55
00:04:27,580 --> 00:04:29,100
host memory.

56
00:04:29,100 --> 00:04:35,390
So when we look at the, the function
there's also a header file

57
00:04:35,390 --> 00:04:40,190
that you will need to include in order for
that for this function to work properly.

58
00:04:40,190 --> 00:04:46,480
This is the include file of cuda.h.
So this is a line that you need to add to

59
00:04:46,480 --> 00:04:52,790
your CUDA files in order for for the, the
functions to be the host code to be

60
00:04:52,790 --> 00:04:58,260
able to, to get access to all the API
functions properly.

61
00:04:58,260 --> 00:05:03,130
So, here we show three main parts in the
host code.

62
00:05:03,130 --> 00:05:06,070
The first part is to allocate device
memory for A, B

63
00:05:06,070 --> 00:05:10,210
and C and copy A and B to the device
memory.

64
00:05:10,210 --> 00:05:14,280
So this is illustrated in the top picture
here where we have part

65
00:05:14,280 --> 00:05:17,940
one that is copying data from the host
memory to a device memory

66
00:05:17,940 --> 00:05:21,060
after we allocated them in the device
memory.

67
00:05:21,060 --> 00:05:25,730
The second part is for the host code to
launch the

68
00:05:25,730 --> 00:05:31,060
kernel function, and this will be the
topic for the next lecture.

69
00:05:31,060 --> 00:05:36,650
And after the, the the kernel completes
its execution, the host code

70
00:05:36,650 --> 00:05:42,310
part three will copy C the output the
result of the computation.

71
00:05:42,310 --> 00:05:43,000
Back from the

72
00:05:43,000 --> 00:05:45,120
device memory to the host memory.

73
00:05:45,120 --> 00:05:47,980
And this is illustrated as part C in the
picture.

74
00:05:51,770 --> 00:05:55,590
In order to really understand what's going
on with these API functions,

75
00:05:55,590 --> 00:06:01,900
you need to have a good, a conceptual
understanding of the CUDA memories.

76
00:06:01,900 --> 00:06:05,270
And this picture is actually a simplified
picture and we do

77
00:06:05,270 --> 00:06:09,420
have all the CUDA memory, device memory
types in this picture.

78
00:06:09,420 --> 00:06:14,050
However, we're showing two important parts
that you

79
00:06:14,050 --> 00:06:17,460
will need to understand in order to
understand

80
00:06:17,460 --> 00:06:20,720
API functions and, and immediately in the
next

81
00:06:20,720 --> 00:06:23,680
lecture you also need to use, need to
understand.

82
00:06:23,680 --> 00:06:29,556
The registers in order to understand a
simple piece of kernel code.

83
00:06:29,556 --> 00:06:33,670
So the the simple way of looking at this

84
00:06:33,670 --> 00:06:37,940
is that each device will have many, many
threads.

85
00:06:37,940 --> 00:06:43,880
So remember, these threads are actually
virtualized Von Neumann processors.

86
00:06:43,880 --> 00:06:48,090
So you can think about these threads as
processors.

87
00:06:48,090 --> 00:06:51,650
And each processor will have a set of
registers.

88
00:06:51,650 --> 00:06:57,380
And so these registers hold variables that
are private to the thread.

89
00:06:58,490 --> 00:07:04,460
And then all the threads will have access
to a shared global memory.

90
00:07:04,460 --> 00:07:07,810
So the, this is going to be also
important.

91
00:07:07,810 --> 00:07:09,050
In the kernel,

92
00:07:09,050 --> 00:07:12,790
we're going to see that some of the
accesses are

93
00:07:12,790 --> 00:07:16,900
going to be to the shared global memory
variables.

94
00:07:18,000 --> 00:07:22,110
For the purpose of this lecture the more
important part is

95
00:07:22,110 --> 00:07:28,330
that the host code can actually allocate
memory in the global memory.

96
00:07:28,330 --> 00:07:33,830
And also request data copy from the global
mem from the host memory to the global

97
00:07:33,830 --> 00:07:39,950
memory and and vice versa that is from
global memory back to host memory.

98
00:07:39,950 --> 00:07:44,490
We will cover more memory types later in
the, in

99
00:07:44,490 --> 00:07:48,030
the subsequent lectures when we talk about
localities and so on.

100
00:07:50,990 --> 00:07:56,960
The first type of API functions that we're
going to focus on is that CUDA

101
00:07:56,960 --> 00:08:03,910
Device Memory Management API functions.
Mainly this cudaMalloc and cudaFree.

102
00:08:03,910 --> 00:08:08,632
cudaMalloc function allocates objects in a
device global memory, right here.

103
00:08:08,632 --> 00:08:16,160
And it would it will take two parameters,
one is the address of a pointer to

104
00:08:16,160 --> 00:08:18,520
the allocated object, and the other one is

105
00:08:18,520 --> 00:08:22,360
the size of allocated object in terms of
bytes.

106
00:08:22,360 --> 00:08:25,610
For C programmers this is the pretty much

107
00:08:25,610 --> 00:08:28,250
the same as the Malloc function because
the

108
00:08:28,250 --> 00:08:32,070
malloc function in C also requires the
size

109
00:08:32,070 --> 00:08:35,400
to be the allocated object in terms of
bytes.

110
00:08:36,458 --> 00:08:42,730
The difference that you will notice as a C
programmer is that C malloc

111
00:08:42,730 --> 00:08:47,110
actually returns a pointer value to the
allocated object.

112
00:08:47,110 --> 00:08:53,200
But here we actually passed the address of
a pointer to the allocated pointer and

113
00:08:53,200 --> 00:08:56,590
then the func, the allocation function is

114
00:08:56,590 --> 00:09:01,380
going to extentiate the value into the
pointer.

115
00:09:01,380 --> 00:09:05,540
So that this is really a call by reference
activity.

116
00:09:05,540 --> 00:09:08,280
The reason why it's different is because

117
00:09:08,280 --> 00:09:12,212
all the CUDA API functions return error
code, which

118
00:09:12,212 --> 00:09:15,750
we'll, we'll see in the next in a few
slides.

119
00:09:15,750 --> 00:09:19,720
So, because the return value is always the

120
00:09:19,720 --> 00:09:24,126
error code then the only way that that
cudaMalloc

121
00:09:24,126 --> 00:09:28,740
function can systematically return a
pointer to the allocated

122
00:09:28,740 --> 00:09:33,790
object is by doing this call by reference
convention.

123
00:09:34,800 --> 00:09:41,730
The second function is cudaFree and
cudaFree essentially frees the object

124
00:09:41,730 --> 00:09:47,375
from the global memory so that the memory
space can be, can be recycled.

125
00:09:47,375 --> 00:09:52,430
And, it only takes one parameter which is
a pointer to the free object.

126
00:09:52,430 --> 00:09:56,930
Be very careful here the parameter in
cudaFree is

127
00:09:56,930 --> 00:10:00,320
the pointer to the free object whereas the
parameter,

128
00:10:00,320 --> 00:10:03,530
the first parameter to the cudaMalloc
function is

129
00:10:03,530 --> 00:10:07,290
the address of the pointer to the
allocated object.

130
00:10:11,240 --> 00:10:14,890
The second category of API functions that
we'll be

131
00:10:14,890 --> 00:10:19,420
using today is the host device data
transfer API functions.

132
00:10:19,420 --> 00:10:26,890
And this is mainly the cudaMemcpy or a
function memcpy.

133
00:10:26,890 --> 00:10:32,140
CUDA mem copy function, this is also
fashioned after the C

134
00:10:32,140 --> 00:10:36,830
mem copy function.
It performs memory data transfer and

135
00:10:36,830 --> 00:10:42,650
it requires four parameters, the first
parameter is pointer to the destination.

136
00:10:42,650 --> 00:10:46,420
And the second one is pointed to the
source, the third one is number of

137
00:10:46,420 --> 00:10:50,960
files to be copied, and the fourth one is
the type or direction of the transfer.

138
00:10:50,960 --> 00:10:57,280
And we typically will use predefined
constants in CUDA to indicate this type,

139
00:10:57,280 --> 00:11:02,180
and as we will see in the next slide.
So the, the transfer

140
00:11:02,180 --> 00:11:07,040
to the device by this function is
asynchronous, meaning

141
00:11:07,040 --> 00:11:12,170
that we can request one copy by calling
cudaMemcpy.

142
00:11:12,170 --> 00:11:16,800
But cudaMemcpy will return to the, will
return right away, even before

143
00:11:16,800 --> 00:11:21,650
the copy is complete, so that we can
immediately request another memcpy.

144
00:11:21,650 --> 00:11:24,170
And this is actually really very important
when

145
00:11:24,170 --> 00:11:28,560
we begin to utilize task level
parallelism, and

146
00:11:28,560 --> 00:11:32,820
so we're going to come back to this point
later in the course.

147
00:11:36,460 --> 00:11:38,670
So now we're, that we, we have

148
00:11:38,670 --> 00:11:44,560
introduced the cudaMalloc, cudaFree, and
cudaMemcpy were

149
00:11:44,560 --> 00:11:47,400
ready to convert our vector addition
example

150
00:11:47,400 --> 00:11:51,130
code, host code, into the real host code.

151
00:11:51,130 --> 00:11:55,430
So this function now is no longer just the
outline.

152
00:11:55,430 --> 00:11:58,730
We actually have all the statements that

153
00:11:58,730 --> 00:12:02,620
implement that those parts, several part
two.

154
00:12:02,620 --> 00:12:07,670
So in part one now we have a declaration
of DA,

155
00:12:07,670 --> 00:12:12,670
DB and DC and these are pointers that

156
00:12:12,670 --> 00:12:17,970
through the object allocated object in the
device memory.

157
00:12:17,970 --> 00:12:22,820
And so the first cudaMalloc function Is

158
00:12:22,820 --> 00:12:28,410
going to allocate the device memory for
vector A.

159
00:12:28,410 --> 00:12:34,170
And as you can see, the size calculation
is n times the size of float.

160
00:12:34,170 --> 00:12:37,060
So each float in CUDA is four bytes, so

161
00:12:37,060 --> 00:12:40,820
therefore, n is the number of elements in
the vector.

162
00:12:40,820 --> 00:12:47,480
So this gives us the size in terms of
bytes, now we have cudaMemcpy and the

163
00:12:47,480 --> 00:12:50,200
destination is the device memory, that's
why it

164
00:12:50,200 --> 00:12:53,590
is D_A, the sources in the host memory

165
00:12:53,590 --> 00:12:55,190
that's why it's H_A.

166
00:12:55,190 --> 00:12:58,090
The size gives the number of bytes and
then

167
00:12:58,090 --> 00:13:03,050
there is the pre-defined constant
cudaMemCpy host to device.

168
00:13:03,050 --> 00:13:07,170
And this constant is actually defined in
that cuda.h

169
00:13:07,170 --> 00:13:11,740
file that you included in the in the
source file.

170
00:13:11,740 --> 00:13:18,198
And we have once we allocated that memory,
we can go and do the KUDA main copy,

171
00:13:18,198 --> 00:13:24,380
we can do the same thing for Vector B.
So we

172
00:13:24,380 --> 00:13:30,900
allocate B and we do the copy from host B
of B from host memory to device memory.

173
00:13:32,390 --> 00:13:37,690
We allocate memory for C.
We know, we don't need to copy from

174
00:13:37,690 --> 00:13:43,200
host to C because C is the result of the
computation, so the kernel

175
00:13:43,200 --> 00:13:46,320
is going to generate the, all the values
of C.

176
00:13:46,320 --> 00:13:51,360
Part two remains remains a comment here,
because we're going

177
00:13:51,360 --> 00:13:54,410
to come back in the next lecture to
complete this

178
00:13:54,410 --> 00:13:58,790
part and then part three is to copy the
data

179
00:13:58,790 --> 00:14:03,880
the result from the device memory back
into the host memory.

180
00:14:03,880 --> 00:14:08,750
And we are going to use a constant CUDA
mem copy device

181
00:14:08,750 --> 00:14:14,320
to host, to indicate the direction of the
copy, and after we're done

182
00:14:14,320 --> 00:14:18,110
we can just go ahead and free A, B, and C
from the device.

183
00:14:19,330 --> 00:14:24,310
So this piece of code is, gives you the
all the

184
00:14:24,310 --> 00:14:27,210
code that needs to, that you need to have
to allocate

185
00:14:27,210 --> 00:14:31,310
memory, to copy data, in preparation for
the kernel execution and

186
00:14:31,310 --> 00:14:34,260
then copy the result data back and free up
the memory.

187
00:14:36,090 --> 00:14:41,500
In general, when we actually try to, to
get performance out

188
00:14:41,500 --> 00:14:46,120
of this kind of code, we cannot afford to
copy

189
00:14:46,120 --> 00:14:51,580
data back and forth before and after each
kernel invocation.

190
00:14:51,580 --> 00:14:56,600
So for real applications, we tend to have
[UNKNOWN] data that just

191
00:14:56,600 --> 00:15:01,340
reside in the device memory, and then we
just keep launching Kernels to

192
00:15:01,340 --> 00:15:04,500
perform computation on the device memory.

193
00:15:04,500 --> 00:15:11,770
So but because this is the beginning of
our course where it just showing a

194
00:15:11,770 --> 00:15:17,500
very simple example and show you all the
pieces of the, that can

195
00:15:17,500 --> 00:15:23,020
be involved, so that you know exactly how
to copy, allocate memory, you know

196
00:15:23,020 --> 00:15:26,630
how to copy data from host to device, you
know how to copy results

197
00:15:26,630 --> 00:15:28,460
from device to host.

198
00:15:28,460 --> 00:15:33,600
But in a real application, some of these
steps may not be necessary, because the

199
00:15:33,600 --> 00:15:39,740
data may just have already been residing
in the memory, deep device memory.

200
00:15:39,740 --> 00:15:43,700
Or in some cases, the result can stay on
the device memory for future use.

201
00:15:43,700 --> 00:15:44,960
They don't have to be copied back.

202
00:15:48,200 --> 00:15:52,700
In practice what we have shown so far is
that we just

203
00:15:52,700 --> 00:15:57,844
go ahead and call it cudaMalloc function
and that that's the job.

204
00:15:57,844 --> 00:15:59,870
However, in practice I would like to

205
00:15:59,870 --> 00:16:04,340
encourage you to always check for error
conditions.

206
00:16:04,340 --> 00:16:06,696
So here is what you really should have

207
00:16:06,696 --> 00:16:10,504
done when you do, do call a cudaMalloc
function.

208
00:16:10,504 --> 00:16:13,754
You should declare a variable

209
00:16:13,754 --> 00:16:19,006
of the type cudaError_t.
And this is a predetermine,

210
00:16:19,006 --> 00:16:25,510
defined type in the cuda API.
And this is also from that cuda.h file.

211
00:16:25,510 --> 00:16:30,250
And then you can declare a variable in
this case we call that variable err.

212
00:16:30,250 --> 00:16:33,280
So that when we call the cudaMalloc, we
will

213
00:16:33,280 --> 00:16:37,540
take the return value, assign that to err
variable.

214
00:16:37,540 --> 00:16:39,560
And this error

215
00:16:39,560 --> 00:16:42,750
code is going to be checked.

216
00:16:42,750 --> 00:16:48,130
You can check the error code and see if it
is cudaSuccess.

217
00:16:48,130 --> 00:16:51,600
Whenever the error code is cudaSuccess,
that means

218
00:16:51,600 --> 00:16:54,440
that the function has completed what you
asked for.

219
00:16:54,440 --> 00:17:00,320
In this case, cudaMalloc has successfully
allocated the size of memory

220
00:17:00,320 --> 00:17:06,570
and assigned that pointer to the object to
the d_A pointer.

221
00:17:06,570 --> 00:17:08,440
So that was a success.

222
00:17:08,440 --> 00:17:11,850
However, if the error is not a success,
then

223
00:17:11,850 --> 00:17:14,610
you need to actually figure out what went
wrong.

224
00:17:14,610 --> 00:17:20,690
In most cases, the reason why there is an
error condition in cudaMalloc is because

225
00:17:20,690 --> 00:17:26,830
there's not enough device memory to
satisfy this allocation request.

226
00:17:26,830 --> 00:17:32,240
So a good way to bring out the error
message is by calling

227
00:17:32,240 --> 00:17:40,410
the cudaGetErrorString API function.
This is a function that this also

228
00:17:40,410 --> 00:17:47,140
provided as part of the CUDA API and this
will convert error code into a stream.

229
00:17:47,140 --> 00:17:49,070
That's human readable.

230
00:17:49,070 --> 00:17:53,110
And then there just like the standard C
functions, you

231
00:17:53,110 --> 00:17:55,040
can use underscore, underscore, file,

232
00:17:55,040 --> 00:17:57,200
underscore, underscore, and underscore,
underscore,

233
00:17:57,200 --> 00:18:03,940
line, underscore, underscore, to print out
the position where the error happens.

234
00:18:03,940 --> 00:18:08,620
And this is going to be the position with
of your error checking right here.

235
00:18:08,620 --> 00:18:10,480
So this is going to give you the line
position

236
00:18:10,480 --> 00:18:14,450
of this if, of this printf statement right
here

237
00:18:14,450 --> 00:18:17,500
and then you know, you can exit, then you

238
00:18:17,500 --> 00:18:21,028
can exit a function with the exit failure
code.

239
00:18:21,028 --> 00:18:22,930
So that instead

240
00:18:22,930 --> 00:18:27,780
of executing this code with the error
already happened you can

241
00:18:27,780 --> 00:18:32,930
actually ex, ex, exit the function and
debug your function

242
00:18:32,930 --> 00:18:38,070
and see why there's not a sufficient
memory to satisfy this cudaMalloc request.

243
00:18:40,240 --> 00:18:46,040
in, in the future slides, I'm still going
to be showing cudaMalloc, cudaMemCpy

244
00:18:46,040 --> 00:18:51,790
clauses all with this error code checking
because that keeps the slide simple.

245
00:18:51,790 --> 00:18:55,800
But when you do the lab assignments, I
really would like

246
00:18:55,800 --> 00:18:59,430
you to, encourage you to use this kind of
error code sequence.

247
00:18:59,430 --> 00:19:02,500
Even though it makes your code a lot
bigger, in the long run,

248
00:19:02,500 --> 00:19:05,560
it will save you a lot of time and a lot
of a

249
00:19:05,560 --> 00:19:12,170
lot of stress, because it will help you to
catch any of these

250
00:19:12,170 --> 00:19:18,880
errors so that you, it will be a lot
easier for you to debug your program.

251
00:19:18,880 --> 00:19:24,660
So, now we have completed a very quick
introduction to the CUDA APIs that help

252
00:19:24,660 --> 00:19:31,100
you to do memory allocation and data
transfer, and we also showed a

253
00:19:31,100 --> 00:19:36,950
error reporting String API function, so if
you are

254
00:19:36,950 --> 00:19:41,910
interested in understanding more about the
CUDA

255
00:19:41,910 --> 00:19:46,590
API functions please read chapter three of
the textbook, thank you.